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Abstract 

We  are  given  M  training  samples  of  iV-element  column  vectors  in  a  matrix 
A"  and  a  predefined  constant  A.  We  want  to  compute  the  lower-triangular 
matrix  which  is  the  Cholesky  factor  of  R  —  XX *  +  XI  using  highly  parallel 
hardware,  either  using  FPGAs  or  ASICs.  Adding  A  is  called  diagonal  loading. 
In  most  adaptive  processing  applications,  diagonal  loading  is  used  to  reduce 
the  sensitivity  of  the  adaptation  to  errors  due  to  insufficient  sample  support 
and  to  slight  errors  in  the  target  model. 

Mathematically,  we  first  prefix  y/XI  to  X  and  then  we  use  N  size  M  +  1 
Householder  postmultiplication,  each  carried  out  in  a  virtual  superprocessor. 
We  format  the  computation  so  that  each  Householder  operation  affects  the 
same  number  of  columns,  but  with  fewer  and  fewer  rows.  Actual  superpro¬ 
cessors  each  share  the  work  of  two  virtual  superprocessors.  This  allows  each 
superprocessor  to  be  physically  identical  with  each  other,  while  all  are  used 
with  100%  efficiency.  Data  is  moved  from  one  superprocessor  to  another  a 
row  at  a  time. 

*This  work  is  sponsored  by  Defense  Advanced  Research  Projects  Agency,  under  Air 
Force  Contract  F19628-00-C-0002.  Opinions,  interpretations,  conclusions,  and  recommen¬ 
dations  are  those  of  the  author  and  are  not  necessarily  endorsed  by  the  United  States 
Government. 
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Most  of  the  arithmetic  operations  in  a  Householder  transformation  are 
simple  in  two  important  ways.  They  can  be  prescheduled,  and  they  cannot 
give  unbounded  results.  Hence  they  are  easily  parallelized.  These  simple 
operations  are  segregated  into  two  groups.  One  group  is  dot-products  and  the 
other  group  uses  operations  that  multiply  one  row  by  a  scalar  and  add  it  to 
another  row.  Both  these  operations  allow  us  to  flow  data  with  each  column’s 
elements  moving  only  vertically  and  maintaining  their  order,  a  perfect  recipe 
for  systolic  computation.  Each  row  is  used  as  soon  as  it  is  transferred. 

The  small  number  of  more  complicated  operations  we  need,  a  square 
root  and  a  few  multiplications  and  divisions,  are  carried  out  in  a  physically 
separate  part  of  the  superprocessor.  All  the  floating  point  operations  are 
confined  to  this  part  of  the  processor.  They  don’t  need  to  be  fast,  because 
we  can  keep  the  multipliers  and  adders  100%  occupied  by  working  on  several 
different  triangularizations  at  the  same  time. 
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The  Matrix  Triangularization  Problem 


A  common  task  in  adaptive  signal  processing  is  as  follows: 
We  have  a  set  of  N  training  vectors,  each  with  M  components. 
These  constitute  a  matrix  X  and  we  need  the  Cholesky  factor, 
T,  of  its  correlation  matrix  R=XXh  +  kl. 

Usually  N  »  M. 

The  cost  of  the  computation  is  of  the  order  of  M2N.  If  M2N 
is  large,  we  will  need  some  parallel  computation  to  keep  up 
with  a  real  time  requirement. 
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The  Matrix  Triangularization  Problem 


A  common  approach  is  to  premultiply  the  N  by  M  matrix 
X  by  each  of  a  sequence  of  Householder  matrices, 
one  after  the  other.  Most  of  the  operations  required  are 
adds  and  multiplies,  and  it  is  straightforward  to  perform 
many  adds  and  multiplies  in  parallel,  but  the  algorithm 
also  requires  a  few  divisions  and  square  roots.  These 
interfere  with  the  efficiency  of  the  use  of  a  parallel 
array  of  multipliers  and  adders. 
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The  Matrix  Triangularization  Problem 


This  talk  is  about  an  architecture  which  might  be  suitable 
to  realize  using  FPGAs.  We  have  in  mind  problems  with 
M  &  20  and  N  &  100. 

FPGAs  are  now  available  with  approximately  1 00  built-in 
multipliers  and  with  the  capability  to  create  a  similar 
number  of  adders.  Hence  about  ten  FPGAs  should  be  able  to 
perform  about  1000  multiply-adds  in  parallel. 

Our  architecture  should  use  these  1 00  multipliers  and  adders 
with  near  1 00%  efficiency  and  we  desire  that  all  the  FPGAs 
be  identical  (and,  indeed,  might  later  be  replaced  by  custom 
ASICs. 
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Steps  to  triangularization 
using  unitary  matrix  premultiplications 
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Steps  to  triangularization 
with  diagonal  loading 
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The  Math  of  Zeroing  a  Column 


(f>  =  ^  x,<f>.  —  XX.  N  operations  per  column 

fa  =  %  =  y/<f>  +  Z,  jU  =  1  /  £ 

e  =  — - - 

£(£  -  <?) 


1  operation  per  column 
N  operations  per  column 
1  operation  per  column 


999999-7 
XYZ  10/15/2004 


MIT  Lincoln  Laboratory 


The  Math  of  Zeroing  a  Column 
Opportunities  for  parallelism 
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N  multiplications  at  once 
conj(Xj)  •  Xjj_;  i=l,...,N 

(times  the  number  of  columns) 


N  multiplications  at  once 

Pj  *  x, ;  i=l,...,N 

(times  the  number  of  columns 
minus  1) 
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Front  End  Processing 
^-processor  computes  Oj 
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Skewed  Front  End  Processing 
Computes  Oj 
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Series-parallel  Front  End  Processing 
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Computing  %  and  0 


X  Saved  as  part  of  the  answer 

Used  in  output  processor 

*  These  are  needed  before  output  processing  can  begin. 

*  They  are  relatively  complicated  computations  and  will  be 
computed  slowly. 

*  Our  aim  is  to  organize  the  algorithm  so  that  the  slow 
computation  of  \  and  0  can  be  buried. 


hi  =%  =  ']</>  + 


e  = 


1 


999999-12 
XYZ  10/15/2004 


MIT  Lincoln  Laboratory 


Output  processor 


For  each  column 


Compute  and  broadcast  to  all  multipliers 


X 


J 


-P]X 


H<t>i 


One  complex  multiply  per  element. 
The  first  column  was  saved  and  the 
current  column  streams  by. 
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Output  Processor 
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Overview  of  the  column  elimination 


9  There  are  M  columns.  The  process  that  eliminates  column  i 
accepts  M-i+1  columns  in  their  normal  order  and  spits  out 
M-i  columns. 

9  Elements  move  only  horizontally  and  are  involved  in 

arithmetic  only  with  other  elements  on  the  same  horizontal 
level. 

9  Sums  propagate  upward  -  % ,  0,  and  Pj  are  computed  at  the 
upper  edge  of  the  processor. 

9  Pj  must  travel  upward  from  the  bottom  edge  to  where  it  is 
needed  by  a  multiplier. 
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Many  Processors 


•  Let  us  define  a  virtual  super-processor.  Its  job  is  to  zero  out  one 
column. 

•  Let  t  be  the  time  separation  between  successive  columns 
presented  to  the  processor  input. 

•  t  is  also  the  time  required  to  do  the  multiplies  needed  for  <Dj  and  is 
the  time  multiply  x  by  fy. 

•  Then  the  time  that  column  j  spends  inside  the  virtual  super¬ 
processor  is  K  t  and  most  of  this  is  waiting  for  the  computation  of 

0,  and  Pj.  (We’ll  determine  K  later.) 

•  We  desire  that  the  virtual  super-processor  whose  job  is  to  zero 
out  column  i+1  be  ready  for  column  j  as  soon  as  it  is  computed  by 
the  previous  virtual  super-processor. 
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When  do  columns  get  where? 


virtual 

superprocessor 

first  column  enters 

column  j  enters 

last  column  enters 

1 

0 

0-1)  * 

(M-1)  T 

2 

(K+1)  T 

(K+J-1)t 

(K+M-1)  t 

3 

(2K+2)  t 

(2K+j-1)  t 

(2K+M-1)  t 

i 

((M)K+(M))t 

((M)K+j-1)x 

((i-1)K+M-1)  t 
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Super-processor  sharing 


virtual 

superprocessor 

first  column  enters 

column  j  enters 

last  column  enters 

i 

((M)K+(M))t 

((M)K+j-1)x 

((i-1)K+M-1)  t 

M+1-i 

((M-i  )K+(M-i))  t 

((M-i)K+j-1)  t 

((M-i)K+M-1)  t 

These  two  virtual  super-processors  together  process  MT1 
columns,  independent  of  i,  so  it  is  tempting  to  combine  them 
into  one  actual  super-processor.  (M/2)  actual  super-processors 
are  needed  for  the  whole  triangularization. 
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Make  Super-processors  Identical 


virtual 

superprocessor 

first  column  enters 

column  j  enters 

last  column  enters 

i 

((i-1)K+(i-1))  T 

((i-1)K+j-1)  T 

((i-1)K+M-1)  t 

M+1-i 

((M-i  )K+(M-i))  t 

((M-i)K+j-1)  t 

((M-i)K+M-1)  t 

When  actual  super-processor  i  accepts  its  last  column  from  actual  super-processor  i-1,  in 
the  next  interval  it  is  ready  to  accept  the  first  column  from  actual  super-processor  i+1,  but 
that  must  be  from  an  earlier  triangularization  problem.  The  M/2  super-processors  begin  a 
new  triangularization  problem  every  (M+l)  x. 

If  that  column  is  to  be  ready,  we  require  (/  - 1  )K  +  M  =  (M  -  i)(K  + 1)  modulo  M+l 
or 

(2K  + 1)/  =  (M  +  l)K  =  0  modulo  M+l 

So  we  choose  K  to  make  2K+1  =  M+l,  K=M/2 
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Column  Timing 
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A  Dose  of  Reality 


9  Problems  with  word  length  and  scaling 
*  Problems  with  input  and  output 
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Problems  with  word  length 
and  scaling 


The  N  elements  of  input  column  i  have  the  same  total  energy 

as  the  i  elements  of  the  final  output  for  that  column,  so  some  element 

might  have  dynamic  range  expansion  of  up  to  Jn. 

So  we  might  need  floating  point.  (This  is  not  a  result  of  the 
architecture  -  it  is  intrinsic  to  the  problem.)  FPGAs  come  with 
efficient  built-in  multipliers,  but  not  built-in  floating 
point.  We  don’t  know  how  many  floating  point  multipliers 
and  adders  we  can  get  in  a  single  FPGA. 
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Problems  with  input  and  output 


Our  architecture  has  negligible  internal  control,  but  requires 
that  data  arrive  from  multiple  problems  at  just  the  right 
time,  including  skewing. 

Several  problems  are  active  at  once  and  late  t-elements 
from  one  problem  get  delivered  to  the  customer  after  the  early 
t-elements  from  later  problems. 

So  we  will  need  an  interface  that  transfers  data  for 
several  “customers”  to  and  from  the  processing  array. 
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Summary 


We’ve  presented  principles  for  an  architecture  suitable  for  realizing 
matrix  triangularization  with  highly  parallel  use  of  multipliers  and 
adders.  Identical  parts  are  used  and  internal  control  is  negligible. 

Parallelism  comes  from  working  on  many  independent  problems 
at  once.  The  waiting  time  for  square  roots  and  divisions  is  buried 
and  does  not  reduce  the  efficiency  of  the  use  of  multipliers. 

The  architecture  will  only  become  practical  when  FPGAs  can 
realize  large  numbers  of  floating  point  adders  and  multipliers. 
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